<!DOCTYPE HTML PUBLIC "-//ORA//DTD CD HTML 3.2//EN">
<HTML>
<HEAD>
<TITLE>[Chapter 2] Lexical Analysis</TITLE>
<META NAME="author" CONTENT="Mark Grand">
<META NAME="date" CONTENT="Thu Jul 31 13:06:17 1997">
<META NAME="form" CONTENT="html">
<META NAME="metadata" CONTENT="dublincore.0.1">
<META NAME="objecttype" CONTENT="book part">
<META NAME="otheragent" CONTENT="gmat dbtohtml">
<META NAME="publisher" CONTENT="O'Reilly &amp; Associates, Inc.">
<META NAME="source" CONTENT="SGML">
<META NAME="subject" CONTENT="java">
<META NAME="title" CONTENT="Java Language Reference">
<META HTTP-EQUIV="Content-Script-Type" CONTENT="text/javascript">
</HEAD>
<body vlink="#551a8b" alink="#ff0000" text="#000000" bgcolor="#FFFFFF" link="#0000ee">

<DIV CLASS=htmlnav>
<H1><a href='index.htm'><IMG SRC="gifs/smbanner.gif"
     ALT="Java Language Reference" border=0></a></H1>
<table width=515 border=0 cellpadding=0 cellspacing=0>
<tr>
<td width=172 align=left valign=top><A HREF="ch01_05.htm"><IMG SRC="gifs/txtpreva.gif" ALT="Previous" border=0></A></td>
<td width=171 align=center valign=top><B><FONT FACE="ARIEL,HELVETICA,HELV,SANSERIF" SIZE="-1">Chapter 2</FONT></B></TD>
<td width=172 align=right valign=top><A HREF="ch02_02.htm"><IMG SRC="gifs/txtnexta.gif" ALT="Next" border=0></A></td>
</tr>
</table>

&nbsp;
<hr align=left width=515>
</DIV>
<H1 CLASS=chapter><A CLASS="TITLE" NAME="JLR2-CH-2">2. Lexical Analysis</A></H1>

<DIV CLASS=htmltoc>

<p>
<b>Contents:</b><br>
Pre-Processing<br>
<A HREF="ch02_02.htm">Tokenization</A><BR>

<p>
</DIV>

<P CLASS=para>
When
the Java compiler compiles a program, the first thing it does is 
determine the structure of the program. The compiler reads the
characters in the program source and then applies rules to recognize
progressively larger chunks of the file, such as identifiers, expressions,
statements, and classes. The process of discovering the organization
of the program is divided into two components:

<P>
<UL CLASS=itemizedlist>
<li CLASS=listitem>The <I CLASS=emphasis>lexical analyzer.</I> This component looks for sequences of characters
called <I CLASS=emphasis>tokens</I> that form identifiers, literals,
operators, and the like.

<P>
<li CLASS=listitem>The <I CLASS=emphasis>parser</I>. This component is responsible for discovering
higher levels of organization in the sequences of tokens discovered
by lexical analysis.

<P>
</UL>
<P CLASS=para>
This chapter describes the rules
governing the lexical analysis of Java programs. The rules governing
the parsing of Java programs are described over the course of subsequent
chapters.

<P CLASS=para>
The lexical analysis rules for Java can appear
slightly ambiguous. Where ambiguity occurs, the rules for interpreting
character sequences specify that conflicts are resolved in favor
of the interpretation that matches the most characters. That's a
bit confusing, so an example should help. Take the character sequence:

<DIV CLASS=screen>
<P>
<PRE>
+++
</PRE>
</DIV>

<P CLASS=para>
The ambiguity is that the sequence could potentially be
interpreted as either <tt CLASS=literal>+</tt> followed
by <tt CLASS=literal>++</tt>
or the other way around; both are valid tokens. But according to
the lexical analysis rules that insist that tokenization favor the
longest character match, Java interprets the character sequence
as:

<DIV CLASS=screen>
<P>
<PRE>
++ +
</PRE>
</DIV>

<P CLASS=para>
Because <tt CLASS=literal>++</tt>
is longer than <tt CLASS=literal>+</tt>, Java first recognizes the
token <tt CLASS=literal>++</tt> and then the <tt CLASS=literal>+</tt>.

<P CLASS=para>
These
rules can produce undesired results when character sequences are
not separated by white space. For example, the following sequence is ambiguous:

<DIV CLASS=screen>
<P>
<PRE>
x++y
</PRE>
</DIV>

<P CLASS=para>
The programmer probably intended this sequence to
mean "<tt CLASS=literal>x + (+y)</tt>", but the lexical analyzer
always produces the token sequence "<tt CLASS=literal>x ++ y</tt>".
This sequence is syntactically incorrect.

<P CLASS=para>
Java lexical analysis consists of two phases: pre-processing and tokenization.
The pre-processing phase is discussed in the following section.
The tokenization phase is responsible for recognizing the tokens
in the pre-processed input and is discussed later in this chapter.

<DIV CLASS=sect1>
<h2 CLASS=sect1><A CLASS="TITLE" NAME="JLR2-CH-2-SECT-1">2.1 Pre-Processing</A></h2>

<P CLASS=para>
<A NAME="CH02.PREPRO1"></A><A NAME="CH02.PREPRO2"></A>A Java program is a sequence of characters. These characters
are represented using 16-bit numeric codes defined by the Unicode
standard.[1]
Unicode is a 16-bit character encoding standard that includes representations
for all of the characters needed to write all major natural languages,
as well as special symbols for mathematics. Unicode defines the
codes 0 through 127 to be consistent with ASCII. Because
of that consistency, Java programs can be written in ASCII without
any need for programmers to be aware of Unicode.

<blockquote class=footnote>
<P CLASS=para>[1] 
Unicode is defined by an organization called
the Unicode Consortium. The
defining document for Unicode is <I CLASS=emphasis>The Unicode Standard, 
Version 2.0</I> (published by Addison-Wesley,
ISBN 0-201-48345-9). More recent
information about Unicode is 
available at <I class=emphasis>http://unicode.org/</I>.
</blockquote>
<P CLASS=para>
Java
is based on Unicode to allow Java programs to be useful in as many
parts of the world as possible. Internally, Java programs store
characters as 16-bit Unicode characters. The benefits of using Unicode
are currently difficult to realize, however, because most operating
environments do not support Unicode. And those environments that
do support Unicode generally do not include fonts that cover more
than a small subset of the Unicode character set.

<P CLASS=para>
Since
most operating environments do not support Unicode, Java uses a
pre-processing phase to make sure that all of the characters of
a program are in Unicode. This pre-processing comprises two steps:

<P>
<UL CLASS=itemizedlist>
<li CLASS=listitem>Translate the program source into Unicode characters if it
is in an encoding other than Unicode. Java defines escape
sequences that allow all characters that can be represented in Unicode
to be represented in other character encodings, such as ASCII or
EBCDIC. The escape sequences are recognized by the compiler, even
if the program is already represented in Unicode.

<P>
<li CLASS=listitem>Divide the stream of Unicode characters into lines.

<P>
</UL>
<DIV CLASS=sect2>
<h3 CLASS=sect2><A CLASS="TITLE" NAME="JLR2-CH-2-SECT-1.1">Conversion to Unicode</A></h3>

<P CLASS=para>
<A NAME="CH02.UNICODE1"></A><A NAME="CH02.UNICODE2"></A>The first thing a Java compiler does is translate its input from the source
character encoding (e.g., ASCII or EBCDIC) into Unicode.
During the conversion process, Java translates escape sequences
of the form <tt CLASS=literal>\u</tt> followed by four hexadecimal digits
into the Unicode characters indicated by the given hexadecimal values.
These escape sequences let you represent Unicode characters in
whatever character set you are using for your
source code, even if it is not Unicode. For example,
<tt CLASS=literal>\u0000</tt> is a way of representing the NUL character.

<P CLASS=para>
More formally, the compiler input is converted from a stream
of <I CLASS=emphasis>EscapedSourceCharacters</I> into a stream of
Unicode characters. <I CLASS=emphasis>EscapedSourceCharacter</I>
is defined as:


<p>
<img align=middle src="./figs/jlr0201.gif" alt="[Graphic: Figure from the text]" width=424 height=67 border=0>

<P CLASS=para>
<I CLASS=emphasis>HexDigit</I>
is either a <I CLASS=emphasis>Digit</I> or one of the following letters:
<tt CLASS=literal>A</tt>, <tt CLASS=literal>a</tt>, <tt CLASS=literal>B</tt>,
<tt CLASS=literal>b</tt>, <tt CLASS=literal>C</tt>, <tt CLASS=literal>c</tt>,
<tt CLASS=literal>D</tt>, <tt CLASS=literal>d</tt>, <tt CLASS=literal>E</tt>,
<tt CLASS=literal>e</tt>, <tt CLASS=literal>F</tt>, or <tt CLASS=literal>f</tt>.

<P CLASS=para>
A <I CLASS=emphasis>Digit</I>
is one of the following characters:
<tt CLASS=literal>0</tt>, <tt CLASS=literal>1</tt>,
<tt CLASS=literal>2</tt>, <tt CLASS=literal>3</tt>, <tt CLASS=literal>4</tt>,
<tt CLASS=literal>5</tt>, <tt CLASS=literal>6</tt>, <tt CLASS=literal>7</tt>,
<tt CLASS=literal>8</tt>, or <tt CLASS=literal>9</tt>.

<P CLASS=para>
As you
can see, the definition of <I CLASS=emphasis>EscapedSourceCharacter</I>
specifies that the `<tt CLASS=literal>u</tt>' in the escape sequence
can occur multiple times. Multiple occurrences have the same meaning
as a single occurrence of `<tt CLASS=literal>u</tt>'.

<P CLASS=para>
If the program source is already
in Unicode, this conversion step is still performed in order
to process these <tt CLASS=literal>\u</tt> escapes.

<P CLASS=para>
The
Java language specification recommends, but does not require, that
the classes that come with Java use
the <tt CLASS=literal>\u</tt><I CLASS=emphasis>xxxx</I>
escapes when called upon to display a character that would not otherwise
be displayable.

</DIV>

<DIV CLASS=sect2>
<h3 CLASS=sect2><A CLASS="TITLE" NAME="JLR2-CH-2-SECT-1.2">Division of the Input Stream into Lines</A></h3>

<P CLASS=para>
The
second step of pre-processing is responsible for recognizing sequences
of characters that terminate lines. The character sequence that indicates
the end of a line varies with the operating environment. By recognizing
end-of-line character sequences during pre-processing, Java makes
sure that subsequent compilation steps do not need to be concerned
with multiple representations for the end of a line.

<P CLASS=para>
In
this step, the lexical analyzer recognizes the combinations of carriage
return (<tt CLASS=literal>\u000D</tt>) and line feed (<tt CLASS=literal>\u000A</tt>)
characters that are in widespread use as end-of-line indicators:


<p>
<img align=middle src="./figs/jlr0202.gif" alt="[Graphic: Figure from the text]" width=424 height=73 border=0>

<P CLASS=para>
As always, ambiguities in lexical rules are resolved by matching the
longest possible sequence of characters. That means that the sequence
of a carriage return character followed by a linefeed character
is always recognized as a one-line terminator, never as two.

</DIV>

</DIV>


<DIV CLASS=htmlnav>

<P>
<HR align=left width=515>
<table width=515 border=0 cellpadding=0 cellspacing=0>
<tr>
<td width=172 align=left valign=top><A HREF="ch01_05.htm"><IMG SRC="gifs/txtpreva.gif" ALT="Previous" border=0></A></td>
<td width=171 align=center valign=top><a href="index.htm"><img src='gifs/txthome.gif' border=0 alt='Home'></a></td>
<td width=172 align=right valign=top><A HREF="ch02_02.htm"><IMG SRC="gifs/txtnexta.gif" ALT="Next" border=0></A></td>
</tr>
<tr>
<td width=172 align=left valign=top>Notational Conventions</td>
<td width=171 align=center valign=top><a href="index/idx_0.htm"><img src='gifs/index.gif' alt='Book Index' border=0></a></td>
<td width=172 align=right valign=top>Tokenization</td>
</tr>
</table>
<hr align=left width=515>

<IMG SRC="gifs/smnavbar.gif" USEMAP="#map" BORDER=0> 
<MAP NAME="map"> 
<AREA SHAPE=RECT COORDS="0,0,108,15" HREF="../javanut/index.htm"
alt="Java in a Nutshell"> 
<AREA SHAPE=RECT COORDS="109,0,200,15" HREF="../langref/index.htm" 
alt="Java Language Reference"> 
<AREA SHAPE=RECT COORDS="203,0,290,15" HREF="../awt/index.htm" 
alt="Java AWT"> 
<AREA SHAPE=RECT COORDS="291,0,419,15" HREF="../fclass/index.htm" 
alt="Java Fundamental Classes"> 
<AREA SHAPE=RECT COORDS="421,0,514,15" HREF="../exp/index.htm" 
alt="Exploring Java"> 
</MAP>
</DIV>

</BODY>
</HTML>
